Image stitching is the process of combining multiple images to create a panoramic or wide-angle image.
Sewing patterns define the structural foundation of garments and are essential for applications such as fashion design, fabrication, and physical simulation. Despite progress in automated pattern generation, accurately modeling sewing patterns remains difficult due to the broad variability in panel geometry and seam arrangements. In this work, we introduce a sewing pattern modeling method based on an implicit representation. We represent each panel using a signed distance field that defines its boundary and an unsigned distance field that identifies seam endpoints, and encode these fields into a continuous latent space that enables differentiable meshing. A latent flow matching model learns distributions over panel combinations in this representation, and a stitching prediction module recovers seam relations from extracted edge segments. This formulation allows accurate modeling and generation of sewing patterns with complex structures. We further show that it can be used to estimate sewing patterns from images with improved accuracy relative to existing approaches, and supports applications such as pattern completion and refitting, providing a practical tool for digital fashion design.
Brain foundation models bring the foundation model paradigm to the field of neuroscience. Like language and image foundation models, they are general-purpose AI systems pretrained on large-scale datasets that adapt readily to downstream tasks. Unlike text-and-image based models, however, they train on brain data: large-datasets of EEG, fMRI, and other neural data types historically collected within tightly governed clinical and research settings. This paper contends that training foundation models on neural data opens new normative territory. Neural data carry stronger expectations of, and claims to, protection than text or images, given their body-derived nature and historical governance within clinical and research settings. Yet the foundation model paradigm subjects them to practices of large-scale repurposing, cross-context stitching, and open-ended downstream application. Furthermore, these practices are now accessible to a much broader range of actors, including commercial developers, against a backdrop of fragmented and unclear governance. To map this territory, we first describe brain foundation models' technical foundations and training-data ecosystem. We then draw on AI ethics, neuroethics, and bioethics to organize concerns across privacy, consent, bias, benefit sharing, and governance. For each, we propose both agenda-setting questions and baseline safeguards as the field matures.
Artificial Intelligence-Generated Content (AIGC) has made significant strides, with high-resolution text-to-image (T2I) generation becoming increasingly critical for improving users' Quality of Experience (QoE). Although resource-constrained edge computing adequately supports fast low-resolution T2I generations, achieving high-resolution output still faces the challenge of ensuring image fidelity at the cost of latency. To address this, we first investigate the performance of super-resolution (SR) methods for image enhancement, confirming a fundamental trade-off that lightweight learning-based SR struggles to recover fine details, while diffusion-based SR achieves higher fidelity at a substantial computational cost. Motivated by these observations, we propose an end-edge collaborative generation-enhancement framework. Upon receiving a T2I generation task, the system first generates a low-resolution image based on adaptively selected denoising steps and super-resolution scales at the edge side, which is then partitioned into patches and processed by a region-aware hybrid SR policy. This policy applies a diffusion-based SR model to foreground patches for detail recovery and a lightweight learning-based SR model to background patches for efficient upscaling, ultimately stitching the enhanced ones into the high-resolution image. Experiments show that our system reduces service latency by 33% compared with baselines while maintaining competitive image quality.
High-fidelity 3D tooth models are essential for digital dentistry, but must capture both the detailed crown and the complete root. Clinical imaging modalities are limited: Cone-Beam Computed Tomography (CBCT) captures the root but has a noisy, low-resolution crown, while Intraoral Scanners (IOS) provide a high-fidelity crown but no root information. A naive fusion of these sources results in unnatural seams and artifacts. We propose a novel, fully-automated pipeline that fuses CBCT and IOS data using a deep implicit representation. Our method first segments and robustly registers the tooth instances, then creates a hybrid proxy mesh combining the IOS crown and the CBCT root. The core of our approach is to use this noisy proxy to guide a class-specific DeepSDF network. This optimization process projects the input onto a learned manifold of ideal tooth shapes, generating a seamless, watertight, and anatomically coherent model. Qualitative and quantitative evaluations show our method uniquely preserves both the high-fidelity crown from IOS and the patient-specific root morphology from CBCT, overcoming the limitations of each modality and naive stitching.
This paper introduces SENA (SEamlessly NAtural), a geometry-driven image stitching approach that prioritizes structural fidelity in challenging real-world scenes characterized by parallax and depth variation. Conventional image stitching relies on homographic alignment, but this rigid planar assumption often fails in dual-camera setups with significant scene depth, leading to distortions such as visible warps and spherical bulging. SENA addresses these fundamental limitations through three key contributions. First, we propose a hierarchical affine-based warping strategy, combining global affine initialization with local affine refinement and smooth free-form deformation. This design preserves local shape, parallelism, and aspect ratios, thereby avoiding the hallucinated structural distortions commonly introduced by homography-based models. Second, we introduce a geometry-driven adequate zone detection mechanism that identifies parallax-minimized regions directly from the disparity consistency of RANSAC-filtered feature correspondences, without relying on semantic segmentation. Third, building upon this adequate zone, we perform anchor-based seamline cutting and segmentation, enforcing a one-to-one geometric correspondence across image pairs by construction, which effectively eliminates ghosting, duplication, and smearing artifacts in the final panorama. Extensive experiments conducted on challenging datasets demonstrate that SENA achieves alignment accuracy comparable to leading homography-based methods, while significantly outperforming them in critical visual metrics such as shape preservation, texture integrity, and overall visual realism.
Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.
Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io
Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and remote sensing applications. One of the main challenges is to stitch together multiple images into a single high-resolution image that covers a large area. Featurebased image stitching algorithms are commonly used but can suffer from errors and ambiguities in feature detection and matching. To address this, several approaches have been proposed, including using bundle adjustment techniques or direct image alignment. In this paper, we present a novel method that uses a combination of IMU data and computer vision techniques for stitching images captured by a UAV. Our method involves several steps such as estimating the displacement and rotation of the UAV between consecutive images, correcting for perspective distortion, and computing a homography matrix. We then use a standard image stitching algorithm to align and blend the images together. Our proposed method leverages the additional information provided by the IMU data, corrects for various sources of distortion, and can be easily integrated into existing UAV workflows. Our experiments demonstrate the effectiveness and robustness of our method, outperforming some of the existing feature-based image stitching algorithms in terms of accuracy and reliability, particularly in challenging scenarios such as large displacements, rotations, and variations in camera pose.
Image stitching often faces challenges due to varying capture angles, positional differences, and object movements, leading to misalignments and visual discrepancies. Traditional seam carving methods neglect semantic information, causing disruptions in foreground continuity. We introduce SemanticStitch, a deep learning-based framework that incorporates semantic priors of foreground objects to preserve their integrity and enhance visual coherence. Our approach includes a novel loss function that emphasizes the semantic integrity of salient objects, significantly improving stitching quality. We also present two specialized real-world datasets to evaluate our method's effectiveness. Experimental results demonstrate substantial improvements over traditional techniques, providing robust support for practical applications.
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap